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Abstract. We consider the problem of dictionary matching in a stream. 
Given a set of strings, known as a dictionary, and a stream of charac¬ 
ters arriving one at a time, the task is to report each time some string 
in our dictionary occurs in the stream. We present a randomised al¬ 
gorithm which takes 0(log log(fc + m)) time per arriving character and 
uses O(fclogm) words of space, where k is the number of strings in the 
dictionary and m is the length of the longest string in the dictionary. 


1 Introduction 

We consider the problem of dictionary matching in a stream. Given a set of 
strings, known as a dictionary, and a stream of characters arriving one at a 
time, the task is to determine when some string in our dictionary matches a suffix 
of the growing stream. The dictionary matching problem models the common 
situation where we are interested in not only a single pattern that may occur 
but in fact a whole set of them. 

Dictionary matching is considered one of the classic and most widely stud¬ 
ied problems within the field of combinatorial pattern matching. The original 
solution of Aho and Corasick jT] has, for example, been cited over 2800 times. 
The dictionary problem along with its efficient solutions also admit a very wide 
range of practical applications: from searching for DNA sequences in genetic 
databases m to intrusion detection |20] and many more. The dictionaries that 
are used in these applications are often also very large as they may contain all 
strings within a neighbourhood of some seed for example, or even all strings 
in a language defined by a particular regular expression. As a result, there is a 
pressing need for methods which are not only fast but also use as little space as 
possible. 

The solutions we present will be analysed under a particularly strong model 
of space usage. We will account for all the space used by our algorithm and will 
not, for example, even allow ourselves to store a complete version of the input. 
In particular, we will neither be able to store the whole of the dictionary nor 
the streaming text. We now define the problem which will be the main object of 
study for this paper more formally. 

Problem 1. In the dictionary-matching problem we have a set of patterns V 
and a streaming text T = ti... t n which arrives one character at a time. We 


must report all positions in T where there exists a pattern in V which matches 
exactly. More formally, we output all the positions x such that there exists a 
pattern Pi GV with t x _\pA + i ... t x = Pi . We must report an occurrence of some 
pattern in V as soon as it occurs and before we can process the subsequent 
arriving character. 

If all the patterns in the text had the same length to then we could straight¬ 
forwardly deploy the fingerprinting method of Karp and Rabin m to maintain 
a fingerprint of a window of length to successive characters of the text. We can 
then compare this for each new character that arrives to a hash table of stored 
fingerprints of the patterns in the dictionary. In our notation this approach would 
require 0{k + to) words of space and constant time per arrival. However if the 
patterns are not all the same length this technique no longer works. 

For a single pattern, Porat and Porat m showed that it is possible to per¬ 
form exact matching in a stream quickly using very little space. To do this 
they introduced a clever combination of the randomised fingerprinting method 
of Karp and Rabin and the deterministic and classical KMP algorithm [Mj. 
Their method uses O (log to) words of space and takes O (log to) time per arriv¬ 
ing character where to is the length of the single pattern. Breslauer and Galil 
subsequently made two improvements to this method. First, they sped up the 
method to only require 0(1) time per arriving character and they also showed 
that it was possible to eliminate the possibility of false negatives, which could 
occur using the previous approach j3... 

Our solution takes the single-pattern streaming algorithm of Breslauer and 
Galil [3] as its starting point. If we were to run this algorithm independently in 
parallel for each separate string in the dictionary, this would take 0{k) time per 
arriving character and O(fclogm) words of space. Our goal in this paper is to 
reduce the running time to as close to constant as possible without increasing the 
working space. Achieving this presents a number of technical difficulties which 
we have to overcome. 

The first such hurdle is how to process patterns of different lengths efficiently. 
In the method of Breslauer and Galil prefixes of power of two lengths are found 
until either we encounter a mismatch or a match is found for a prefix of length 
at least half of the total pattern size. Exact matches for such long prefixes can 
only occur rarely and so they can afford to check each one of these potential 
matches to see if it can be extended to a full match of the pattern. However 
when the number of patterns is large we can no longer afford to inspect each 
pattern every time a new character arrives. 

Our solution breaks down the patterns in the dictionary into three cases: 
short patterns, long patterns with short periods, long patterns with long periods. 
A key conceptual innovation that we make is a method to split the patterns into 
parts in such a way that matches for all of these parts can be found and stitched 
together at exactly the time they are needed. We achieve this while minimising 
the total working space and taking only 0(loglog(fc + to)) time per arriving 
symbol. 
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A straightforward counting argument tells us that any randomised algorithm 
with inverse polynomial probability of error requires at least f2(klog n) bits of 
space, see for example J3]. Our space requirements are therefore within a log¬ 
arithmic factor of being optimal. However, unlike the single-pattern algorithm 
of Breslauer and Galil, our dictionary matching algorithm can give both false 
positives and false negatives with small probability. 

Throughout the rest of this paper, we will refer to the arriving text character 
as the arrival. We can now give our main new result which will be proven in the 
remaining parts of this paper. 

Theorem 1. Consider a dictionary V of k patterns of size at most m and a 
streaming text T. The streaming dictionary matching problem can be solved in 
0(log log(fc + m )) time per arrival and 0{k log m) words of space. The probability 
of error is 0(l/n) where n is the length of the streaming text. 


1.1 Related work 

The now standard offline solution for dictionary matching is based on the Aho- 
Corasick algorithm Q]. Given a dictionary V = {Pi, P 2 ,..., Pk}, and a text 
T = ti...t n , let occ denote the number of matches and M denote the sum 
of the lengths of the patterns in P, that is M = |-F»|- The Aho-Corasick 

algorithm finds all occurrences of elements in V in the text T in 0{M + n + occ) 
time and 0{M) space. Where the dictionary is large, the space required by the 
Aho-Corasick approach may however be excessive. 

There is now an extensive literature in the streaming model. Focusing nar¬ 
rowly on results related to the streaming algorithm of Porat and Porat m, 
this has included a form of approximate matching called parameterised match¬ 
ing |12] , efficient algorithms for detecting periodicity in streams m as well as 
identifying periodic trends [10j . Fast deterministic streaming algorithms have 
also been given which provided guaranteed worst case performance for a num¬ 
ber of different approximate pattern matching problems m as well as pattern 
matching in multiple streams [8]. 

The streaming dictionary matching problem has also been considered in a 
weaker model where the algorithm is allowed to store a complete read-only copy 
of the pattern and text but only a constant number of extra words in working 
space. Breslauer, Grossi and Mignosi [4] developed a real-time string match¬ 
ing algorithm in this model by building on previous work of Crochemore and 
Perrin [9]. The algorithm is based on the computation of periods and critical 
factorisations allowing at the same time a forward and a backward scan of the 
text. 


1.2 Definitions 

We will make extensive use of Karp-Rabin fingerprints m which we now define 
along with some useful properties. 
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Definition 1. Karp-Rabin fingerprint function <f>. Let p be a prime and r a 
random integer in F p . We define the fingerprint function <j> for a string S = 
Si ... se, such that: 

<I>(S) = X)i=i 1 mod p. 

The most important property is that for any two equal length strings U and 
V with U ^ V, the probability that <f(U) = <j>(V) is at most 1/n 2 if p > n 3 . 
We will also exploit several well known arithmetic properties of Karp-Rabin 
fingerprints which we give in Lemma [lj All operations will be performed in the 
word-RAM model with word size 6>(logn). 

Lemma 1. Let U be a string of size £ and V another string, then: 

— <j>(UV) = 0(17) + r e <t>(V) mod p, 

— 4>(U) = 4>{UV) — r e <j>(V) mod p, 

— 4>(V) = r~ i {(j){UV) — </>([/)) mod p. 

For a non-empty string x, an integer p with 0 < p < |ar| is called a period of 
x if Xi = Xi+ P for all ie{l,...,|x| — p — 1}. The period of a non-empty string 
x is simply the smallest of its periods. We will also assume that all logarithms 
are base 2 and are rounded to the nearest integer. 

We describe three algorithms: Ai in Section[5]which handles short patterns in 
the dictionary, and M 20 and A 21 , in Section [3] which deal with the long patterns. 
Theorem [l] is obtained by running all three algorithms simultaneously. 

2 Algorithm A.\. Short patterns 

Lemma 2. There exists an algorithm Ai which solves the streaming dictio¬ 
nary matching problem and runs in 0(log log(fc + to)) time per arrival and uses 
0(k login) space on a dictionary of k patterns whose maximum length is at most 
2/clogTO. 

For very short patterns shorter than 2 log to we can straightforwardly con¬ 
struct an Aho-Corasick automaton [I]. To make this efficient we store a static 
perfect hash table at each node to navigate the automaton. The automaton oc¬ 
cupies at most 0(k\ogm) space and reports occurrences of short patterns in 
constant time per arrival. From now on, we can assume that all patterns are 
longer than 21og?n. 

Our solution splits each of the patterns, which are all now guaranteed to have 
length greater than 2 log to, into two parts in multiple ways. The first part of 
each splitting of the pattern we call the head and the rest we call the tail. Tails 
will always have length £ for all £ s.t. log?n < £ < 2 log to. We will therefore 
split each pattern into at most log to head/tail pairs, making a total of at most 
/clog to heads overall. 

The overall idea is to insert all heads into a data structure so that we can 
find potential matches in the stream efficiently. We will only look for potential 
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matches every log to arrivals. We use the remaining at least log to arrivals before 
a full match can occur both to de-amortise the cost of finding head matches as 
well as to check if the relevant tails match as well. 

In order to look for matches with heads of patterns efficiently we will use a 
slight modification of the probabilistic 2 -fast trie introduced by Djamal Bella- 
zougui et al. [2] (Theorem 4.1 [2]). A 2 -fast trie is a randomised data structure 
which compactly represents a trie on a set of strings. Our modification to the 
probabilistic 2 -fast trie simply uses a different signature function. For a string 
s = si... Sk we define it to be <j>(sk ■ • ■ si), the fingerprint of the reverse of s. 
Otherwise the data structure remains unchanged. 

An important concept in this data structure is the exit node of a string x. 
This is the deepest node labelled by a prefix of x. Given a string x and signatures 
of all its prefixes, we can find the exit node of x using the 2 -fast trie in 0(log m + 
loglog(fc + m)) time, where m is the maximal length of the strings. Importantly, 
the lookup procedure compares at most logm+loglog(fc+m) pairs of signatures, 
and hence the probability of a false match is at most lo g" 1 + lo O°g( fc + m ) < L. 
When there are no false positives in signatures comparison, correctness and the 
time bound are guaranteed by Lemma 4.2 and Lemma 4.3 of [2]. 

We can now describe Algorithm Ai assuming that all patterns are longer 
than 2 log to but no longer than 2fclogm. As a preprocessing step, we build the 
probabilistic 2 -fast trie for the reverse of the at most fclogTO heads. For regularly 
spaced indices of the text, we will use the 2 -fast trie to find the longest head 
that matches at each of these locations. 

We will also augment the z-fast trie in the following way. We mark each 
node labelled by a head with a colour representing the fingerprint of the cor¬ 
responding tail. In the end, each node may be marked by several colours, and 
the total number of colours will be k log to. On top of the 2 -fast trie we build 
coloured-ancestor data structure EE!- This occupies 0(k\ogm) space and sup¬ 
ports Find(u,c) queries in 0(log log(A log to)) = 0(loglog(/c + to)) time, where 
Find(u, c) is the lowest ancestor of a node u marked with colour c. Each pattern 
consists of one head concatenated to its corresponding tail and so we will use 
coloured-ancestor queries to find the longest whole pattern matches by using the 
fingerprints of different tails as queries. 

At all times we maintain a circular buffer of size 2fclogm which holds the 
fingerprints of the at most recent 2ATogTO prefixes of the text. Let i be an 
integer multiple of log?n. For each such i , we query the 2 -fast trie with a string 
x = ti... f i_2fc log m+i • Note that for each prefix of x we can compute its signature 
in 0(1) time with the help of the buffer. The query returns the exit node e(x) 
of x in 0(log to + loglog(fc + to)) time, which is used to analyse arrivals in the 
interval [z + log to, i + 2 log to] . This exit node corresponds to the longest head 
that matches ending at index i. The O(logm) cost of performing the query is 
de-amortised during the interval (z,z + log to]. 

For each arrival te, £ £ [i + log to, i + 2 log ?rz] we compute the fingerprint <j> 
of .. ,t(,. This can be done in constant time as we store the last 2fclogm > 
to > 21ogTO fingerprints. If Find(e(x), </>) is defined, £ is an endpoint of a whole 
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pattern match and we report it. Otherwise, we proceed to the next arrival. 
The overall time per arrival is therefore dominated by the time to perform the 
coloured-ancestor queries which is 0(loglog(fc + to)) . 

We remark that the algorithm can be applied also to patterns of maximal 
length 4fclog?n and the time complexity will be unchanged. Moreover, if there 
are several possible patterns that match for a given arrival, the algorithm reports 
the longest such pattern. These two properties will be needed when we describe 
Algorithm A 26 in Section 1,01 


3 Long patterns 

We now assume that all the patterns have length greater than 2fclogm. We 
distinguish two cases according to the periodicity of those patterns: those with 
short period and those with long period. Hereafter, to distinguish the cases, we 
use the following notation. Let to, = \Pi\ and Qi be the prefix of Pi such that 
\Qi\ = nii — /clog to. Let pQ i be the period of Qi. The remaining patterns are 
then partitioned in two disjoint groups of patterns, those with pQ i < /clog to 
and those with pQ i > /clog to. We describe two algorithms: A 2 a and A 2 b, one 
for each case respectively. Finally, the overall solution is then to run all three 
algorithms Ai, Ai,, , A 2 b simultaneously to obtain Theorem [T] 


3.1 Algorithm A-2a- Patterns with short periods 


This section gives an algorithm for a dictionary of patterns V = P\,...,Pk such 
that to,; > 2 k log to and pQ i < k log to. Recall that Qi is the prefix of Pi of length 
m, — k log in and pQ t is the period of Qi. The overall idea for this case is that 
if we can find enough repeated occurrences of the period of a pattern then we 
know we have almost found a full pattern match. As the pattern may end with 
a partial copy of its period we will have to handle this part separately. The main 
technical hurdle we overcome is how to process different patterns with different 
length periods in an efficient manner. 

We define the tail of a pattern Pi to be its suffix of length 2/clogm. Observe 
that a Pi match occurs if and only if there is a match of Qi followed by a match 
with the tail of P 

Let Ki be the prefix of Qi of length k log to. Further observe that Qi can only 


match if there is a sequence of 


IQiHgd , 1 
PQi 


occurrences of Ki in the text, each 


occurring exactly pQ i characters after the last. This follows immediately from 
the fact that A, has length fclogm and Qi has period pg, < /clog to. 

We now describe algorithm A 211 which solves this case. At all times we main¬ 
tain a circular buffer of size 2/clogm which holds the fingerprints of the most 
recent 2/clogTO prefixes of the text. That is, if the last arrival is ti, then the 
buffer contains the fingerprints <p(ti ... if-2fciogm+i)i • ■ ■, 4>(t\ ... tg). 

To find Ki matches, we store the fingerprint 4>(Kj) of each distinct Kj in a 
static perfect hash table. By looking up 4>(tg-kiogm+i ■ ■ ■ tg) we can find whether 
some Ki matches in 0(1) time. For each distinct Kj we maintain a list of recent 
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matches stored as an arithmetic progression. Each time we find a new match 
with Kj we check whether it is exactly pQ i characters from the last match. If 
so we include it in the current arithmetic progression. If not, then we delete 
the current progression and start a new progression containing only the latest 
match. Note that Ki = Kj implies that pQ i = pQ j and therefore there is no 
ambiguity in the description. 

We store the fingerprint of each tail in another static perfect hash table. 
For each arrival tg we use this hash table to check whether < fi(tg- 2 kiogm+i ■ ■ ■ tg) 
matches the fingerprint of some tail. This takes 0(1) time per arrival. 

Assume that the tail of some Pi matched. We will justify below that we 
can assume that each tail corresponds to a unique Pi. It remains to decide 
whether this is in-fact a full match with Pj. This is determined by a simple 
check, that is whether the current arithmetic progression for AT,; contains at 


least 


\Qi\-\Kj\ 

PQi 



occurrences. 


Lemma 3. Algorithm A 2 a takes 0(1) time per character and uses 0{k log?n) 
space. 


Proof. The algorithm stores two hash tables, each containing O(fclogw) fin¬ 
gerprints as well as O(fc) arithmetic progressions. The total space is therefore 
0(k log?n) as claimed. The time complexity of 0(1) per character follows by the 
use of static perfect hash tables (which are precomputed and depend only on V). 

We first prove the claim that each tail corresponds to a unique Pi. To this 
end, we assume in this section that no pattern contains another pattern as a 
suffix. In particular, any such pattern can be deleted from the dictionary during 
the preprocessing stage as it does not change the output. This implies the claim 
that each Pi has a distinct tail because the tail contains a full period of Pi. 

The correctness follows almost immediately from the algorithm description 


via the observation that each Qi is formed from 


1 repeats of Ki 


followed by a prefix of Ki. We check explicitly whether there are sufficient repeats 
of Ki in the text stream to imply a Qi match. While we do not check explicitly 
that either final prefix of Ki is a match or that the full P, matches, this is implied 
by the tail match. This is because the tail has length 2k log m and hence includes 
the final prefix of AT,; and the last fclogm characters of Pi (those in Pi but not 
in Qi). □ 


3.2 Algorithm A. 2 b'- Patterns with long periods 

Consider a dictionary V in which the patterns are such that nii > 2fclog?n 
and pQ i > k log in. Let us define k to be number strings in this dictionary. 
We can now describe Algorithm A 2 &. Recall that Qi is the prefix of Pi s.t. 
\Qi\ = mi — k login. For each pattern Pi , we define Pij to be the prefix of Pi 
with length 2 J , 1 < 2- J < m; — 2fclogm. 

We will first give an overview of an algorithm that identifies P;,j matches in 
0(log?n) time per arrival. With the help of A\ and A 20 we will speed it up to 
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achieve an algorithm with 0(loglog(fc + m)) time per arrival. The algorithm will 
identify the matches with a small delay up to k log m arrivals. We then show how 
to extend Pjj to Qi matches. This stage will still report the matches after they 
occur. Finally we show how to find whole pattern matches in the stream using 
the Qi matches while also completely eliminating the delay in the reporting of 
these machines. In other words, any matches for whole patterns will be reported 
as soon as they occur and before the next arrival in the stream as desired. 


0(logm)-time algorithm. We define a logarithmic number of levels. Level 
j will represent all the matches for prefixes Pij. We store only active prefix 
matches, that still have the potential to indicate the start of full matches of a 
pattern in the dictionary. This means that any match at level j whose position 
is more than 2 J+1 from the current position of an arrival is simply removed. We 
will use the following well-known fact. 

Fact 2 (Lemma 3.2 |3j ). If there are at least three matches of a string U of 
length 2 J in a string V of length 2 J+1 , then positions of all matches of U in V 
form an arithmetic progression. The difference of the progression is equal to the 
length of the period of U. 

It follows that if there are at least three active matches for the same prefix at 
the same level, we can compactly store them as a progression in constant space. 
Consider a set of distinct prefixes of length 2 J of the patterns in V. For each of 
them we store a progression that contains: 

(1) The position fp of the first match; 

(2) The fingerprint of t\ ... i fp ; 

(3) The fingerprint of the period p of the prefix; 

(4) The length of the period p of the prefix; 

(5) The position lp of the last match. 

With this information, we can deduce the position and the fingerprint of the 
text from the start to the position of any active match of the prefix. Moreover, 
we can add a new match or delete the first match in a progression in 0(1) time. 

We make use of a perfect hash table TL that stores the fingerprints of all the 
prefixes of the patterns in V. The keys of TL correspond to the fingerprints of 
all the prefixes and the associated value indicates whether the prefix from which 
the key was obtained is a proper prefix of some pattern, a whole pattern itself, 
or both. Using the construction of m, for example, the total space needed to 
store all the fingerprints and their corresponding values is O(klogm). 

When a character t(_ of the text arrives, we update the current position and 
the fingerprint of the current text. The algorithm then proceeds by the progres¬ 
sions over logm levels. We start at level 0. If the fingerprint (pitQ is in TL , we 
insert a new match to the corresponding progression at level 0. 

For each level j from 0 to logm, we retrieve the position p of the first match 
at level j. If p is at distance 2 J+1 from ti, we delete the match and check if the 


fingerprint 4>{t p .. .te) is in TL. If it is and the fingerprint is a fingerprint of one of 
the patterns, we report a match (ending at te, the current position of the text). 
If the fingerprint is in TL and if it is a fingerprint of a proper prefix, then p is 
a plausible position of a match of a prefix of length 2 J+1 . We check if it fits in 
the appropriate progression 7 r at level j + 1. (Which might not be true if the 
fingerprints collided). If it does, we insert p to 7 r. If p does not match in 7 r, we 
discard it and proceed to the next level. 

As updating progressions at each level takes 0(1) time only, and there are 
log to levels, the time complexity of the algorithm is O(logm) per arrival. The 
space complexity is O(fclogm). We deliberately omit some details (for example, 
how to retrieve the position of the first match in the level) as they will not be 
important for the final algorithm. 

0(log log(fc + m))-time algorithm. We will follow the same level-based idea. 
To speed up the algorithm, we will consider prefixes Pij with short and long 
periods separately. The number of matches of the prefixes with short periods 
can be big, but we will be able to compute them fast with the help of Ai and 
A 2 a ■ On the other hand, matches of the prefixes with long periods are rare, and 
we will be able to compute them in a round robin fashion. 

Let pij be the period of Pij. We first build a dictionary D \ containing at most 
one prefix for each Pi. Specifically, containing the largest Pij with the period 
Pij < klogm and 2/clogm < \Pij\ < nii — 2fclogm. If no such Pij exists we do 
not insert a prefix for Pi. This dictionary is processed using a modification of 
algorithm A 2 a which we described in Section Rl.il The modification is that when a 
text character te arrives, the output of the algorithm identifies the longest pattern 
in D\ which matches ending at te or ‘no match’ if no pattern matches. This is in 
contrast to A 2 a as described previously where we only outputted whether some 
pattern matches. The modification takes advantage of the fact that prefixes in 
D\ all have power-of-two lengths and uses a simple binary search approach over 
the O(logTO-) distinct pattern lengths. This increases the run-time of A 2 a to 
O(loglogm) time per arrival. The details can be found in Appendix 1X1 

Whenever a match is found with some pattern in D\ , we update the match 
progression of the reported pattern (but not of any of its suffixes that might be 
in D 1 ). Importantly, we will still have at most two progressions of active matches 
per prefix because of the following lemma and corollary. 

Lemma 4. Let Pij, Pi' j' be two prefixes in D\ and suppose that Pij is a suffix 
of P^ j>. The periods ofPij,Pi>j> are equal. 

Proof. Assume the contrary. Then Pij has two periods: pij and pc j 1 (because 
it is a suffix of Pi'j')- We have pij + pi'j' < 2fclogm < \L\j\. By the periodicity 
lemma (see, e.g., m, Pij is a multiple of pi'j'. But then Pij is periodic with 
period pi'j' < pij, a contradiction. □ 

Corollary 1 . Let Pij, Pi'j 1 , and Pi" j" be prefixes in D\. Suppose that Pij is 
a suffix of Pi'j' and simultaneously is a suffix of Pi" j". Then Pi'j> is a suffix 
of Pi" j" (or vice versa). 
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We now consider any Pi for which we did not find a suitable small period 
prefix. In this case it is guaranteed that there is a prefix P t p with the period 
longer than fc log m but length at most 4 k log m. We build another dictionary Di 
for each of these prefixes. We apply algorithm A\ and for each arrival te return 
the longest prefix Pip in D 2 that matches at it in 0(loglog(fc+m)) time. We then 
need to update the match progression of Pip as well as the match progressions 
of all Pj/ .j' £ D 2 that are suffixes of Pip. Fortunately, each of the prefixes in D 2 
can match at most once in every fclogm arrivals, because the period of each of 
them is long, meaning that we can schedule the updates in a round robin fashion 
to take 0(1) time per arrival. 

We denote a set of all Pip such that p t p > fclogm by S. Any of these 
prefixes can have at most one match in fclogm arrivals. Because of that and 
because |5| < fclogm, we will be able to afford to update the matches in a 
round robin fashion. 

We will have two update processes running in parallel. The first process 
will be updating matches of prefixes Pip £ S such that Ptp-i £ S Li D 2 . 
We consider one of these prefixes per arrival. If there is a match with P l: j in 
\ti — k\ogm,tf\ then there must be a corresponding match with P r p-i ending 
in [t^_23—i-fciogm;As Pij -1 £ S, p-i t j > fclogm so there is at most 
one match. We can determine whether this match can be extended into a P t ,j 
match using a single fingerprint comparison as described in the 0(logm)-time 
algorithm. This is facilitated by storing a circular buffer of the fingerprints of 
the most recent fc log m text prefixes. 

The second process will be updating matches of prefixes Pip £ S such that 
Pi,j—i £ D\. Again, if there is a match with Pip in [tp — k\ogm,tt\ then there 
must be a corresponding match with Pip -1 ending in [t^_ 23 - 1 -fciogm> 23 — 1 ]- 

However, the second process will be more complicated for two reasons. First, 
Pip-i has a small period so there could be many Pip- 1 matches ending in 
this interval. Second, the information about Pip- 1 matches can be stored not 
only in the progressions corresponding to Pip- 1 , but also in the progressions 
corresponding to prefixes that have Pip -1 as a suffix. The first difficulty can be 
overcome because of the following lemma. 

Lemma 5. Consider any Pip such that Pip-i < fclogm < pip. Given a match 
progression for Pip- 1 , only one match could also correspond to a match with 
PiJ ■ 

Proof. Let U be the prefix of Pip -1 of length p t p-i- That is, the substrings 
bounded by consecutive matches in the match progression for P,p-i are equal to 
U. Suppose that Pip starts with exactly r copies of U. Then we have Pip = U r V 
for some string V. Note that as pip -1 < fclogm < p t p. the string V cannot be 
a prefix of U. Then the only match in the progression which could match with 
Pip is the r-th last one. □ 

To overcome the second difficulty, we use Corollary [I] It implies that pre¬ 
fixes in D 1 can be organized in chains based on the “being-a-sufffx” relationship. 
We consider prefixes in each chain in a round robin fashion again. We start at 
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the longest prefix, let it be Pij- At each moment we store exactly one pro¬ 
gression initialized to the progression of Pij. If the progression intersects with 
fo- 23 - 1 —fciogmi 2 J- 1 ]: we identify the ‘interesting’ match in 0(1) time with 
the help of Lemma [5] and try to extend it as in the first process. We then pro¬ 
ceed to the second longest prefix Pi>j> ■ If the stored progression intersects with 
[^_ 2 j -1 _fe i og m - ^_ 2 y—i ], we proceed as for P l 3 . Otherwise, we update the pro¬ 
gression to be the progression of Pi' j' and repeat the previous steps for it. We 
continue this process for all prefixes in the chain. 

From the description of the processes it follows that the matches for each Pi j 
(in particular, for the longest Pij for each i) are outputted in 0(loglog(fc + m)) 
time per arrival with a delay of up to fclogm characters (i.e. at most fclogm 
characters after they occur). 


Finding Qi matches. We now show how to find Qi matches using Pij matches. 
If there is a match with Qi in [ti — fclogm, i^], there must be a match with the 
longest Pij in [te — — fc login, tg — 2 J ]. Because |Pjj| < mj —2fclogm, this match 

has been identified by the algorithm and it is the first match in the progressions. 
We can determine whether this match can be extended into a Qi match using a 
single fingerprint comparison. 

Therefore the Qi matches are outputted in 0(log log(fc+m)) time with a delay 
of up to fclogm characters (i.e. at most fclogm characters after they occur). We 
can then remove this delay using coloured ancestor queries in a similar manner 
to algorithm A\ as described below. 


Finding whole pattern matches and removing the delay. Up to this 
point, we have shown that we can find each Qi match in 0(loglog(fc + m )) time 
per arrival with a delay of at most fclogm characters. Further we only report 
one Qi match at each time. We will show how to extend these Qi matches into 
Pi matches using coloured ancestor queries in 0(loglog(fc + m)) time per arrival. 

Build a compacted trie of the reverse of each string Qi. The edges labels are 
not stored. The space used is 0(k). For each i we can find the reverse of Qi in 
the trie in 0(1) time (by storing an O(fc) space look-up table). 

The tail of each P,; is its (fc logm)-length suffix, i.e. the portion of Pi which 
is not in Qi. Each distinct tail is associated with a colour. As there are at most 
fclogm patterns, there are at most fclogm colours. Computing the colour from 
the tail is achieved using a standard combination of fingerprinting and static 
perfect hashing. For each node in the tree which represents some Qi we colour 
the node with the colour of the tail of Pi. 

Whenever we find a Qi match, we identify the place in the tree where the 
reverse of Qi occurs. Recall that these matches may be found after a delay of 
at most fclogm characters. A Qi match ending at position i — fclogm implies 
a possible Pi match at position i. We remember this potential match until tg 
arrives. 
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More specifically when t( arrives we determine the node u in the trie repre¬ 
senting the reverse of the longest Qi which has a match at position t — k log m. 
This can be done in 0(1) time by storing a circular buffer of fingerprints. 

We now need to decide whether Qi implies the existence of some Pj match. 
It is important to observe that as we discarded all but the longest such Q t , we 
might find a Pj with j ^ i. 

For each arrival te, we compute the fingerprint <j> of t^-fciogm+i ■ ■ ■ ti- This can 
be done in constant time as we store the last fclogm fingerprints. If Find(w, 0) 
is defined, te is an endpoint of a pattern match and we report it. Otherwise, we 
proceed to the next arrival. 

Lemma 6. Algorithm Ahh takes 0(loglog(fc+m)) time per character. The space 
complexity of the algorithm is O(fclogm). 

References 

1. Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. 
Communications of the ACM 18(8), 333-340 (1975) 

2. Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Monotone minimal perfect hashing: 
Searching a sorted table with 0(1) accesses. In: SODA ’09: Proc. 20 th ACM-SIAM 
Symp. on Discrete Algorithms, pp. 785-794 (2009) 

3. Breslauer, D., Galil, Z.: Real-time streaming string-matching. ACM Transactions 
on Algorithms 10(4), 22 (2014) 

4. Breslauer, D., Grossi, R., Mignosi, F.: Simple real-time constant-space string 
matching. In: CPM ’ll: Proc. 22 nd Annual Symp. on Combinatorial Pattern 
Matching, pp. 173-183 (2011) 

5. Broder, A.Z., Mitzenmacher, M.: Survey: Network applications of bloom filters: A 
survey. Internet Mathematics 1(4), 485-509 (2003) 

6. Clifford, R., Jalsenius, M., Porat, E., Sach, B.: Pattern matching in multiple 
streams. In: CPM ’12: Proc. 23 nd Annual Symp. on Combinatorial Pattern Match¬ 
ing. pp. 97-109 (2012) 

7. Clifford, R., Sach, B.: Pseudo-realtime pattern matching: Closing the gap. In: CPM 
TO: Proc. 21 st Annual Symp. on Combinatorial Pattern Matching, pp. 101-111 
( 2010 ) 

8. Clifford, R., Sach, B.: Pattern matching in pseudo real-time. Journal of Discrete 
Algorithms 9(1), 67-81 (2011) 

9. Crochemore, M., Perrin, D.: Two-way string matching. Journal of the ACM 38(3), 
651-675 (1991) 

10. Crouch, M.S., McGregor, A.: Periodicity and cyclic shifts via linear sketches. In: 
Approximation, Randomization, and Combinatorial Optimization. Algorithms and 
Techniques, pp. 158-170. Springer (2011) 

11. Ergun, F., Jowhari, H., Saglam, M.: Periodicity in streams. In: RANDOM TO: 
Proc. 14 th Inti. Workshop on Randomization and Computation, pp. 545-559 (2010) 

12. Jalsenius, M., Porat, B., Sach, B.: Parameterized matching in the streaming model. 
In: STACS ’13: Proc. 30 th Annual Symp. on Theoretical Aspects of Computer 
Science, pp. 400-411 (2013) 

13. Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM 
Journal of Research and Development 31(2), 249 -260 (1987) 


12 



14. Knuth, D.E., Morris, J.H., Pratt, V.B.: Fast pattern matching in strings. SIAM 
Journal on Computing 6, 323-350 (1977) 

15. Lothaire, M.: Algebraic Combinatorics on Words. Cambridge University Press 
(2002), Cambridge Books Online 

16. Muthukrishnan, S., Muller, M.: Time and space efficient method-lookup for object- 
oriented programs. In: SODA ’96: Proc. 7 th ACM-SIAM Symp. on Discrete Algo¬ 
rithms. pp. 42-51 (1996) 

17. Porat, B., Porat, E.: Exact and approximate pattern matching in the streaming 
model. In: FOCS ’09: Proc. 50 th Annual Symp. Foundations of Computer Science, 
pp. 315-323 (2009) 

18. Ruzic, M.: Constructing efficient dictionaries in close to sorting time. In: ICALP 
’08: Proc. 35 th International Colloquium on Automata, Languages and Program¬ 
ming. pp. 84-95 (2008) 

19. Slater, G., Birney, E.: Automated generation of heuristics for biological sequence 
comparison. BMC Bioinformatics 6(31), 1 11 (2005) 

20. Tuck, N., Sherwood, T., Calder, B., Varghese, G.: Deterministic memory-efficient 
string matching algorithms for intrusion detection. In: Proceedings IEEE INFO- 
COM 2004, The 23rd Annual Joint Conference of the IEEE Computer and Com¬ 
munications Societies, Hong Kong, China, March 7-11, 2004. IEEE (2004) 


13 



A Suffixes, powers-of-two and the longest match 

In Section [3.21 we will use algorithm _4-2 Q as a black box. However, we will need 
the output to determine the longest pattern that matches when each new text 
character arrives rather than simply whether a pattern matches. Furthermore, 
we will not be able to guarantee (as is safely assumed above) that no pattern is 
a prefix of another. Fortunately the patterns will all have a power-of-two length. 
We now briefly describe the required changes which increase the running time 
from 0(1) to 0(loglog?n). 

The changes do not affect the algorithm until the point at which some tail, 
has been matched. As one pattern could be a suffix of another, 0(log m) patterns 
could have the same tail. This follows from the fact that the tail contains a full 
period of any pattern Pi and that all patterns have power-of-two lengths. 

Whenever a tail is matched when some arrives, we need to determine the 
longest matching Pi with this tail. Assume, as a motivating special case, that 
every Pi with this tail has the same Ki. As above, Pi is associated with a number 
of occurrences, 

_ _ IQil - \Ki\ , , 

C-i — i -L 

L PQi J 

of Ki that are required for a Pi match. The basic idea is to perform binary 
search on the set of Ci values (for P^s with the matching tail) using the number 
of occurrences of Ki in the current arithmetic progression as the key. As there 
are most O(logm) candidates, this takes 0(loglog?7i) time. 

However, two patterns P,; and Pj with the same tail could have Ki ^ Kj. 
Fortunately, Lemma [7] below says that using the ‘wrong’ I\i only affects the 
number of required matches by at most 1. For each tail, we (arbitrarily) preselect 
a single Ki among the Pi with this tail. We then perform the same binary search 
using Kj . As the O(logm) candidates have power-of-two length (greater than 
2/clogm) for any two patterns Pi / Pj , we have that |c, — Cj > 4 . Therefore, we 
find at most one candidate, Pj is checked using its own I\j. 

Lemma 7. Let Pi and Pj be two patterns with the same tail but Ki ^ Kj . Let 
us also assume that the tail of Pj matches when some tg arrives. Pi matches 
ending at te if the current arithmetic progression for Kj contains at least Ci + 
1 occurrences. Furthermore Pi does not match at te if the same progression 
contains fewer than Cj — 1 matches. 

Proof. Let yi be the number of matches of Ki in the current progression. Anal¬ 
ogously, for yj. The first thing to observe is that |y.; — yj\ < 1. This follows from 
the fact that | K, = \Kj\ , they are both periodic and contain each other’s period 
string. 

Assume that yj < c* — 1. Therefore, as cy < c.j + 1, we have that yj < Ci so 
Pi does not match. Instead assume that yj > Cj + 1. Again, as cy > c, — 1, we 
have that yj > Ci. □ 
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